Measuring Entropy Reduction: How much reduction in the entropy of X can we obtain by knowing Y?
I(X;Y)=H(X)–H(X|Y)=H(Y)−H(Y|X)=H(X)+H(Y)−H(X,Y)
Properties
- symmetric
- non-negative
- I(X;Y)=0 iff X and Y are independent
Formally:
I(X;Y)=∫Y∫Xp(x,y)log(p(x,y)p(x)p(y))dxdy
Expressed by KL Divergence:
I(X;Y)=DKL(p(x,y)∥p(x)p(y))
MI measures the divergence of the actual joint distribution from the expected distribution under the independence assumption.
Furthermore,
I(X;Y)=DKL(p(x,y)∥p(x)p(y))=∫p(x,y)logp(x,y)p(x)p(y)dxdy=∫p(x|y)p(y)logp(x|y)p(x)dxdy=∫p(y)(∫p(x|y)logp(x|y)p(x)dx)dy=EY{DKL(p(x|y)∥p(x))}
Thus, MI can also be understood as the expectation of the Kullback–Leibler divergence of the univariate distribution p(x) of X from the conditional distribution p(x|y) of X given Y: the more different the distributions p(x|y) and p(x) are on average, the greater the information gain.
Reference
Mutual Information: https://en.wikipedia.org/wiki/Mutual_information
Text Mining: https://www.coursera.org/learn/text-mining